author: An introduction to resampling methods autosize: false font-family: ‘Gill Sans’ transition: none
Reference: Data Science Chapter 5
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion: - Chemotherapy patients are 13% more likely to survive past 5 years.
Conclusion: - Chemotherapy patients are 13% more likely to survive past 5 years.
Not so fast! In statistics, we ask “what if?” a lot:
- What if the randomization of patients just happened, by chance, to assign more of the healthier patients to the chemo group?
- Or what if the physicians running the trial had enrolled a different sample of patients from the same clinical population?
Conclusion: - Chemotherapy patients are 13% more likely to survive past 5 years.
Always remember two basic facts about samples:
- All numbers are wrong: any quantity derived from a sample is just a guess of the corresponding population-level quantity.
- A guess is useless without an error bar: an estimate of how wrong we expect the guess to be.
Conclusion: - Chemotherapy patients are 13% \(\pm\) ? more likely to survive past 5 years, with ??% confidence.
By “quantifying uncertainty,” we mean filling in the blanks.
In stats, we equate trustworthiness with stability: - If our data had been different merely due to chance, would our answer have been different, too? - Or would the answer have been stable, even with different data?
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
For example: - If doctors had taken a different sample of 503 cancer patients and gotten a drastically different estimate of the new treatment’s effect, then the original estimate isn’t very trustworthy. - If, on the other hand, pretty much any sample of 503 patients would have led to the same estimates, then their answer for this particular subset of 503 is probably accurate.
Let’s work through a thought experiment…
Imagine Andrey Kolmorogov on four-day fishing trip. - The lake is home to a very large population of fish of varying size and weight.
- On each day, Kolmorogov takes a random sample of size \(N=15\) from this population—that is, he catches (and releases) 15 fish.
- He records the weight and approximate volume of each fish.
- He uses each day’s catch to compute a different estimate of the volume–weight relationship for all fish in the lake.
At right we see the sampling distribution for both \(\beta_0\) and \(\beta_1\).
- Each is centered on the true population value.
- The spread of each histogram tells us how variable our estimates are from one sample to the next.
Suppose we are trying to estimate some population-level quantity \(\theta\): the parameter of interest.
So we take a sample from the population: \(X_1, X_2, \ldots, X_N\).
We use the data to form an estimate \(\hat{\theta}_N\) of the parameter.
Suppose we are trying to estimate some population-level quantity \(\theta\): the parameter of interest.
So we take a sample from the population: \(X_1, X_2, \ldots, X_N\).
We use the data to form an estimate \(\hat{\theta}_N\) of the parameter.
Now imagine repeating this process thousands of times!
Estimator: any method for estimating the value of a parameter (e.g. sample mean, sample proportion, slope of OLS line, etc).
Sampling distribution: the probability distribution of an estimator \(\hat{\theta}_N\) under repeated samples of size \(N\).
Bias: Let \(\bar{\theta}_N = E(\hat{\theta}_N)\) be the mean of the sampling distribution. The bias of \(\hat{\theta}_N\) is \((\bar{\theta}_N - \theta)\): the difference between the average answer and the truth.
Unbiased estimator: \((\bar{\theta}_N - \theta) = 0\).
Standard error: the standard deviation of an estimator’s sampling distribution
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
If an estimator is unbiased, then
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \sqrt{ E[ (\hat{\theta}_N - \theta )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from the truth} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how big of an error do I make, on average?”